Skip to content

#797 Add support for variable-sized OCCURS when writing EBCDIC files#830

Merged
yruslan merged 6 commits intomasterfrom
feature/797-add-support-for-variable-size-occurs
Mar 10, 2026
Merged

#797 Add support for variable-sized OCCURS when writing EBCDIC files#830
yruslan merged 6 commits intomasterfrom
feature/797-add-support-for-variable-size-occurs

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Mar 9, 2026

Closes #797

Summary by CodeRabbit

  • New Features

    • Writer supports OCCURS DEPENDING ON for dynamic arrays with improved RDW handling
    • Added a lightweight holder for dependee field offsets and primitives read/write helpers
    • Validation API expanded to validate copybook/source configurations for reading and writing
  • Bug Fixes

    • String rendering now shows actual field names (no camelCase conversions)
    • Validation produces clearer errors/warnings for unsupported write options
  • Tests

    • New tests for OCCURS, DEPENDING ON, variable-size scenarios, and parameter validation
  • Chores

    • Minor CI message formatting and documentation updates

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Refactors writer AST and RDD processing to support OCCURS and OCCURS DEPENDING ON (variable-size arrays) via dependee tracking; moves primitive field extraction to Copybook companion; tightens writer parameter validation and adds tests and small CI/build edits.

Changes

Cohort / File(s) Summary
AST String Rendering
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala, cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Statement.scala
toString now uses raw name and no longer camel-cases redefined names; removed camelCased field and camelCase() helper.
Copybook helpers & usage
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala, .../reader/.../*.scala, .../parser/extract/BinaryExtractorSpec.scala
Added Copybook.getPrimitiveField and alias extractPrimitiveField; added Copybook.setPrimitiveField; replaced instance calls (copybook.extractPrimitiveField) with companion/object calls across readers and tests.
Writer dependee model
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/DependingOnField.scala, .../WriterAst.scala
New DependingOnField case class; new AST node PrimitiveDependeeField; PrimitiveArray/GroupArray accept optional dependingOn references; added WriterAst docs.
AST construction & writing
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
Major refactor to propagate a dependeeMap through builder methods; added getAst(copybook) normalization; processRDD/writeToBytes updated for variable-length OCCURS and RDW length adjustments; handling for dependee fields wired through generation and writing.
Parameter validation & entrypoint
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala, spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala
Added logger and validateOrThrow(parameters: Map[String,String], hadoopConf) overload; new validateParametersForWriting(readerParameters) enforces disallowed write options (e.g., occurs_mappings, multi-segment); Deserializer now collects parameters into a Map and validates before reader init.
Writer tests
spark-cobol/src/test/scala/.../NestedWriterSuite.scala, .../VariableLengthEbcdicWriterSuite.scala, .../CobolParametersValidatorSuite.scala
Added tests and copybooks for OCCURS and OCCURS DEPENDING ON (including variable-length occurs); updated writer tests and added validator unit tests for write-time constraints and error messages.
Indexing / iterators
cobol-parser/src/main/scala/.../IndexGenerator.scala, .../FixedLenNestedRowIterator.scala, .../VRLRecordReader.scala
Refactored segment-id helpers to use Copybook.getPrimitiveField; simplified signatures in index/segment checks; adjusted index generation to avoid duplicate entries.
CI & build manifest
.github/workflows/jacoco_check.yml, build.sbt
Minor string-quoting changes in CI messages and a single-line build manifest tweak.

Sequence Diagram

sequenceDiagram
    participant User
    participant Writer
    participant Combiner
    participant Builder
    participant DepMap
    participant Processor

    User->>Writer: combine(rdd, copybook, schema)
    Writer->>Combiner: constructWriterAst(copybook, schema)
    Combiner->>DepMap: create empty dependeeMap
    Combiner->>Builder: buildGroupField(..., dependeeMap)
    Builder->>Builder: buildPrimitiveNode(..., dependeeMap)
    alt field has DEPENDING ON
        Builder->>DepMap: register DependingOnField(name -> spec)
        Builder-->>Combiner: emit PrimitiveDependeeField / arrays with dependingOn
    end
    Combiner-->>Writer: return root WriterAst
    Writer->>Processor: processRDD(..., variableSizeOccurs)
    Processor->>Processor: iterate rows, call writeToBytes(ast,..., variableLengthOccurs)
    rect rgba(100,200,100,0.5)
        Processor->>DepMap: consult dependeeMap for dependee values
        Processor->>Processor: compute dynamic OCCURS sizes, write arrays accordingly
    end
    Processor-->>Writer: RDD[Array[Byte]]
    Writer-->>User: output RDD
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 I hop through copybooks, names honest and plain,
I map the fields that tell arrays how to grow,
I tuck their specs in a map, ready for the main,
I stitch bytes in order, row by row,
A rabbit’s nibble, and the writer sings hello.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The PR addresses two of three requested features from issue #797: variable-sized OCCURS support and OCCURS DEPENDING ON handling. VB file support and CP037 code page enhancements are not fully implemented in this changeset. Verify whether support for VB files and CP037 code page enhancements are intended in this PR or planned for separate issues; clarify scope alignment with issue #797 requirements.
Out of Scope Changes check ❓ Inconclusive The PR includes a refactoring to promote field extraction methods to the Copybook companion object, which supports the variable-sized OCCURS feature but represents an additional scope change beyond the stated issue requirements. Confirm whether the Copybook API refactoring is a necessary prerequisite for the OCCURS feature or should be scoped separately; document the rationale for this structural change.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature: adding variable-sized OCCURS support when writing EBCDIC files, which is the primary focus of this changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/797-add-support-for-variable-size-occurs

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.4% -0.12% 🍏
Files changed 54.37% 🍏

File Coverage
VRLRecordReader.scala 93.71% 🍏
IndexGenerator.scala 89.24% 🍏
Copybook.scala 83.16% -0.18% 🍏
FixedLenNestedRowIterator.scala 73.42% 🍏
Primitive.scala 71.43% -4.8%
FixedWithRecordLengthExprRawRecordExtractor.scala 71.12% 🍏
Statement.scala 51.87% -17.65%

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

JaCoCo code coverage report - 'spark-cobol'

Overall Project 83.77% -6.59% 🍏
Files changed 35.63%

File Coverage
DependingOnField.scala 100%
WriterAst.scala 93.51%
NestedRecordCombiner.scala 88.37% -1.86% 🍏
CobolParametersValidator.scala 85.77% -0.82% 🍏
CobolStreamer.scala 0% -24.89%

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala (1)

119-145: ⚠️ Potential issue | 🟠 Major

Reject variable_size_occurs for fixed-length writes.

Now that variable_size_occurs is allowed for writing, NestedRecordCombiner.processRDD() can emit shorter byte arrays for record_format = F. Without RDW or any other framing, that turns the output into an undecodable flat file as soon as one record shrinks. Please fail this combination here, or keep fixed-length writes untrimmed.

⚙️ Proposed validation
   if (readerParameters.recordFormat != RecordFormat.FixedLength && readerParameters.recordFormat != RecordFormat.VariableLength) {
     issues += s"Only '${RecordFormat.FixedLength}' and '${RecordFormat.VariableLength}' values for 'record_format' are supported for writing, " +
       s"provided value: '${readerParameters.recordFormat}'"
   }
+
+  if (readerParameters.variableSizeOccurs &&
+    readerParameters.recordFormat != RecordFormat.VariableLength) {
+    issues += "'variable_size_occurs' is supported for writing only when 'record_format' is 'V'"
+  }
 
   if (readerParameters.occursMappings.nonEmpty) {
     issues += "OCCURS mapping option ('occurs_mappings') is not supported for writing"
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`
around lines 119 - 145, The validator must reject the combination of
fixed-length output and variable-size OCCURS; update
validateParametersForWriting (in CobolParametersValidator) to add a check that
if readerParameters.recordFormat == RecordFormat.FixedLength and the
variable-size-occurs option is enabled (readerParameters.variableSizeOccurs /
variable_size_occurs flag on ReaderParameters), append an error like
"variable_size_occurs is not supported for fixed-length writes" to issues so the
IllegalArgumentException is thrown; this ensures fixed-length writes remain
untrimmed and prevents emitting shorter byte arrays.
🧹 Nitpick comments (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala (1)

134-160: Add a VB rejection regression here too.

This negative-path test currently proves FB is rejected, but not VB. Since issue #797 explicitly mentions VB, a dedicated record_format = "VB" assertion would make the intended writer contract much clearer and protect it from regressions.

Based on learnings, variable-block format writing will not be implemented in the writer; only VariableLength uses RDW headers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala`
around lines 134 - 160, The test in VariableLengthEbcdicWriterSuite ("throw an
exception on unexpected output record format") only asserts that "FB" is
rejected but misses asserting that "VB" is rejected; add a second negative-path
check (or extend the existing intercept) that writes with
.option("record_format", "VB") using the same df/path setup and asserts the
thrown IllegalArgumentException message includes the same "Only 'F' and 'V'
values for 'record_format' are supported for writing, provided value: 'VB'" text
so VB regressions are caught (refer to the test block variables df, path and the
intercept/exception handling used in the current test).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 418-426: The null-array branches in NestedRecordCombiner are
leaving DEPENDING ON fields as raw zeros in fixed-length mode because
Copybook.setPrimitiveField is only invoked when variableLengthOccurs is true;
update both null branches (around the blocks that now return 0 and
cobolField.binaryProperties.actualSize, also the similar case at lines 459-466)
to always call Copybook.setPrimitiveField(spec.cobolField, ar, 0,
fieldStartOffsetOverride = spec.baseOffset) for each spec in dependingOn before
returning, so the dependee PrimitiveDependeeField gets an encoded COBOL zero
even when the array is null and variableLengthOccurs is false.

---

Outside diff comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`:
- Around line 119-145: The validator must reject the combination of fixed-length
output and variable-size OCCURS; update validateParametersForWriting (in
CobolParametersValidator) to add a check that if readerParameters.recordFormat
== RecordFormat.FixedLength and the variable-size-occurs option is enabled
(readerParameters.variableSizeOccurs / variable_size_occurs flag on
ReaderParameters), append an error like "variable_size_occurs is not supported
for fixed-length writes" to issues so the IllegalArgumentException is thrown;
this ensures fixed-length writes remain untrimmed and prevents emitting shorter
byte arrays.

---

Nitpick comments:
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala`:
- Around line 134-160: The test in VariableLengthEbcdicWriterSuite ("throw an
exception on unexpected output record format") only asserts that "FB" is
rejected but misses asserting that "VB" is rejected; add a second negative-path
check (or extend the existing intercept) that writes with
.option("record_format", "VB") using the same df/path setup and asserts the
thrown IllegalArgumentException message includes the same "Only 'F' and 'V'
values for 'record_format' are supported for writing, provided value: 'VB'" text
so VB regressions are caught (refer to the test block variables df, path and the
intercept/exception handling used in the current test).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b88d503c-8012-4362-bbf1-6c0ad43aa41d

📥 Commits

Reviewing files that changed from the base of the PR and between a76495d and 94beb4b.

📒 Files selected for processing (8)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Primitive.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/Statement.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/DependingOnField.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/WriterAst.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/VariableLengthEbcdicWriterSuite.scala

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)

370-373: Mutable state in shared AST node is safe but fragile.

DependingOnField.baseOffset is mutated at line 372 for every row within the same partition. This works correctly because:

  1. Rows are processed sequentially within mapPartitions
  2. Each row gets a fresh byte array
  3. baseOffset is always updated before being read in the subsequent dependingOn.foreach calls

However, this design is fragile—if the processing model ever changes to concurrent row handling, this will silently corrupt data. Consider adding a brief comment explaining why this mutation is safe.

📝 Suggested documentation
       // ── Primitive which has an OCCURS DEPENDS ON ─────────────────────────────
       case PrimitiveDependeeField(spec) =>
+        // NOTE: baseOffset is mutated here for each row. This is safe because rows
+        // are processed sequentially within mapPartitions, and the offset is always
+        // updated before being read in subsequent array-element writes.
         spec.baseOffset = currentOffset
         spec.cobolField.binaryProperties.actualSize
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 370 - 373, The mutation of DependingOnField.baseOffset inside the
PrimitiveDependeeField match is fragile; add a concise comment near the
assignment (spec.baseOffset = currentOffset) explaining that this is safe
because rows are processed sequentially within mapPartitions, each row uses a
fresh byte array, and baseOffset is always updated before any
dependingOn.foreach reads it, and note that concurrent processing would break
this invariant; reference PrimitiveDependeeField, DependingOnField.baseOffset,
mapPartitions, and dependingOn.foreach in the comment so future maintainers
understand the assumption.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Line 281: Fix the comment typo inside NestedRecordCombiner.scala near the
comment above the dependee handling: change "defines" to "defined" in the
comment "// Dependee fields need not to be defines in Spark schema." so it reads
"// Dependee fields need not to be defined in Spark schema." to correct the
spelling in the NestedRecordCombiner class.

---

Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 370-373: The mutation of DependingOnField.baseOffset inside the
PrimitiveDependeeField match is fragile; add a concise comment near the
assignment (spec.baseOffset = currentOffset) explaining that this is safe
because rows are processed sequentially within mapPartitions, each row uses a
fresh byte array, and baseOffset is always updated before any
dependingOn.foreach reads it, and note that concurrent processing would break
this invariant; reference PrimitiveDependeeField, DependingOnField.baseOffset,
mapPartitions, and dependingOn.foreach in the comment so future maintainers
understand the assumption.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7883dcea-55d1-4916-8681-7c58385e6b5c

📥 Commits

Reviewing files that changed from the base of the PR and between e08fbb5 and bfa90d6.

📒 Files selected for processing (3)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidatorSuite.scala

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala (1)

467-476: Reduce coupling to AST layout and exact exception wording.

This test depends on children(5) and the full message with parser line numbers, so harmless AST or wording refactors will break it even if duplicate-dependee validation still works. Prefer selecting the second CNT1 by name and asserting only the stable part of the error.

♻️ Suggested cleanup
-      val ast = parsedCopybook.ast
-      val children = ast.children.head.asInstanceOf[Group].children
-      val cnt2 = children(5).asInstanceOf[Primitive].withUpdatedIsDependee(true)
-      children(5) = cnt2
+      val record = parsedCopybook.ast.children.head.asInstanceOf[Group]
+      val cnt1Indexes = record.children.zipWithIndex.collect {
+        case (p: Primitive, idx) if p.name.equalsIgnoreCase("CNT1") => idx
+      }
+      assert(cnt1Indexes.size == 2)
+      val secondCnt1Index = cnt1Indexes(1)
+      record.children(secondCnt1Index) =
+        record.children(secondCnt1Index).asInstanceOf[Primitive].withUpdatedIsDependee(true)

       val ex = intercept[IllegalArgumentException] {
         NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema)
       }

-      assert(ex.getMessage == "Duplicate field name 'CNT1' found in copybook. Field names must be unique (case-insensitive) when OCCURS DEPENDING ON is used. Already found a dependee field with the same name at line 4, current field line number: 10.")
+      assert(ex.getMessage.contains("Duplicate field name 'CNT1' found in copybook."))
+      assert(ex.getMessage.contains("Field names must be unique (case-insensitive) when OCCURS DEPENDING ON is used."))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`
around lines 467 - 476, The test in NestedWriterSuite is brittle because it
relies on AST index children(5) and exact exception text; update the test so it
finds the second CNT1 node by name (e.g., walk parsedCopybook.ast to locate the
Primitive node whose name equals "CNT1" and then call withUpdatedIsDependee on
that node) instead of using children(5), and change the assertion on the thrown
IllegalArgumentException from exact string equality to a contains check for the
stable fragment (for example verify ex.getMessage contains "Duplicate field name
'CNT1'") while keeping the call to
NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`:
- Around line 153-156: The validator currently only logs a warning when
readerParameters.variableSizeOccurs is true with readerParameters.recordFormat
== RecordFormat.FixedLength; change this to reject the configuration by
throwing/returning a validation error in CobolParametersValidator (instead of
log.warn) when variableSizeOccurs is set with RecordFormat.FixedLength,
referencing the same flags (variableSizeOccurs and recordFormat) and
RecordFormat.FixedLength to locate the check; include a clear error message
stating that variable-size occurs cannot be used with fixed-length writes
because RDW framing is only added for RecordFormat.VariableLength (see
NestedRecordCombiner handling), so validation must fail.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala`:
- Around line 42-47: The code currently calls
ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH) and get(PARAM_SOURCE_PATH)
which throw NoSuchElementException before
CobolParametersValidator.validateOrThrow runs; change the parameter extraction
in CobolStreamer: replace direct get(...) calls with getOption(...) and supply a
safe default (e.g., .getOrElse("") ) when building the parameters Map for
PARAM_COPYBOOK_PATH and PARAM_SOURCE_PATH so that
CobolParametersValidator.validateOrThrow(parameters,
ssc.sparkContext.hadoopConfiguration) can run and produce its intended
validation errors instead of a raw exception.

---

Nitpick comments:
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala`:
- Around line 467-476: The test in NestedWriterSuite is brittle because it
relies on AST index children(5) and exact exception text; update the test so it
finds the second CNT1 node by name (e.g., walk parsedCopybook.ast to locate the
Primitive node whose name equals "CNT1" and then call withUpdatedIsDependee on
that node) instead of using children(5), and change the assertion on the thrown
IllegalArgumentException from exact string equality to a contains check for the
stable fragment (for example verify ex.getMessage contains "Duplicate field name
'CNT1'") while keeping the call to
NestedRecordCombiner.constructWriterAst(parsedCopybook, df.schema).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 889a3459-b227-4f39-aa31-c97cb5aed499

📥 Commits

Reviewing files that changed from the base of the PR and between bfa90d6 and 01da263.

📒 Files selected for processing (3)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala

Comment on lines +153 to +156
if (readerParameters.variableSizeOccurs &&
readerParameters.recordFormat == RecordFormat.FixedLength) {
log.warn("Option 'variable_size_occurs=true' is used with 'record_format=F' which means records can have variable length. It is highly recommended to use 'record_format=V' instead.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject variable_size_occurs for fixed-length writes.

This combination can emit variable-sized records while spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala:45-76 only adds RDW framing for RecordFormat.VariableLength. With record_format=F, downstream consumers lose unambiguous record boundaries, so this should fail validation instead of warning.

🛑 Suggested validation change
-    if (readerParameters.variableSizeOccurs &&
-      readerParameters.recordFormat == RecordFormat.FixedLength) {
-      log.warn("Option 'variable_size_occurs=true' is used with 'record_format=F' which means records can have variable length. It is highly recommended to use 'record_format=V' instead.")
-    }
+    if (readerParameters.variableSizeOccurs &&
+      readerParameters.recordFormat == RecordFormat.FixedLength) {
+      issues += "Option 'variable_size_occurs=true' requires 'record_format=V' for writing"
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala`
around lines 153 - 156, The validator currently only logs a warning when
readerParameters.variableSizeOccurs is true with readerParameters.recordFormat
== RecordFormat.FixedLength; change this to reject the configuration by
throwing/returning a validation error in CobolParametersValidator (instead of
log.warn) when variableSizeOccurs is set with RecordFormat.FixedLength,
referencing the same flags (variableSizeOccurs and recordFormat) and
RecordFormat.FixedLength to locate the check; include a clear error message
stating that variable-size occurs cannot be used with fixed-length writes
because RDW framing is only added for RecordFormat.VariableLength (see
NestedRecordCombiner handling), so validation must fail.

Comment on lines +42 to +47
val parameters: Map[String, String] = Map[String, String](
PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),
PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)
)

CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In Apache Spark's Scala API, what does SparkConf.get(key)do when the key is missing, and how does that differ fromSparkConf.getOption(key)?

💡 Result:

  • SparkConf.get(key) returns the String value if set, but if the key is missing it throws java.util.NoSuchElementException(key). [1]
  • SparkConf.getOption(key) returns an Option[String]: Some(value) if present, otherwise None (no exception). [1]

Sources:
[1] Apache Spark source (SparkConf.scala, v3.5.6) (github.com)

Citations:


🏁 Script executed:

# Locate and inspect CobolStreamer.scala around lines 42-47
fd "CobolStreamer.scala" --type f

Repository: AbsaOSS/cobrix

Length of output: 1855


🏁 Script executed:

# Inspect CobolParametersValidator to understand error message design
fd "CobolParametersValidator.scala" --type f

Repository: AbsaOSS/cobrix

Length of output: 163


🏁 Script executed:

# Find CobolStreamer.scala
fd "CobolStreamer.scala" --path "*/spark-cobol/*"

Repository: AbsaOSS/cobrix

Length of output: 273


🏁 Script executed:

# Read CobolStreamer.scala around lines 42-47
find . -path "*/spark-cobol/*" -name "CobolStreamer.scala" -exec cat -n {} \; | sed -n '35,55p'

Repository: AbsaOSS/cobrix

Length of output: 1236


🏁 Script executed:

# Read CobolParametersValidator to understand error messages
cat -n spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/parameters/CobolParametersValidator.scala | head -150

Repository: AbsaOSS/cobrix

Length of output: 8759


Don't call SparkConf.get() before validation.

These lookups throw NoSuchElementException before CobolParametersValidator.validateOrThrow() runs (line 47). Users see a raw config exception instead of the validator's intended error messages for missing required parameters.

Suggested fix
   implicit class Deserializer(`@transient` val ssc: StreamingContext) extends Serializable {
-    val parameters: Map[String, String] = Map[String, String](
-      PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),
-      PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)
-    )
+    private val conf = ssc.sparkContext.getConf
+    val parameters: Map[String, String] =
+      Seq(PARAM_COPYBOOK_PATH, PARAM_SOURCE_PATH)
+        .flatMap(key => conf.getOption(key).map(value => key -> value))
+        .toMap

     CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
val parameters: Map[String, String] = Map[String, String](
PARAM_COPYBOOK_PATH -> ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH),
PARAM_SOURCE_PATH -> ssc.sparkContext.getConf.get(PARAM_SOURCE_PATH)
)
CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
private val conf = ssc.sparkContext.getConf
val parameters: Map[String, String] =
Seq(PARAM_COPYBOOK_PATH, PARAM_SOURCE_PATH)
.flatMap(key => conf.getOption(key).map(value => key -> value))
.toMap
CobolParametersValidator.validateOrThrow(parameters, ssc.sparkContext.hadoopConfiguration)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/CobolStreamer.scala`
around lines 42 - 47, The code currently calls
ssc.sparkContext.getConf.get(PARAM_COPYBOOK_PATH) and get(PARAM_SOURCE_PATH)
which throw NoSuchElementException before
CobolParametersValidator.validateOrThrow runs; change the parameter extraction
in CobolStreamer: replace direct get(...) calls with getOption(...) and supply a
safe default (e.g., .getOrElse("") ) when building the parameters Map for
PARAM_COPYBOOK_PATH and PARAM_SOURCE_PATH so that
CobolParametersValidator.validateOrThrow(parameters,
ssc.sparkContext.hadoopConfiguration) can run and produce its intended
validation errors instead of a raw exception.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)

59-75: ⚠️ Potential issue | 🟠 Major

Validate the RDW length from the written row, not the copybook maximum.

This pre-check uses cobolSchema.getRecordSize, so big-endian RDW rejects any copybook whose maximum size exceeds 65535 even when variableSizeOccurs makes each emitted row much smaller. The real length is only known at Line 165, and that path currently packs the low bytes without a per-row bounds check.

🧩 Suggested fix
-    val recordLengthLong = cobolSchema.getRecordSize.toLong + adjustment1.toLong + adjustment2.toLong
-    if (recordLengthLong < 0) {
-      throw new IllegalArgumentException(
-        s"Invalid RDW length $recordLengthLong. Check 'is_rdw_part_of_record_length' and 'rdw_adjustment'."
-      )
-    }
-    if (isRdwBigEndian && recordLengthLong > 0xFFFFL) {
-      throw new IllegalArgumentException(
-        s"RDW length $recordLengthLong exceeds 65535 and cannot be encoded in big-endian mode."
-      )
-    }
-    if (!isRdwBigEndian && recordLengthLong > Int.MaxValue.toLong) {
-      throw new IllegalArgumentException(
-        s"RDW length $recordLengthLong exceeds ${Int.MaxValue} and cannot be encoded safely."
-      )
-    }
+    if (!readerParameters.variableSizeOccurs || !hasRdw) {
+      val maxRecordLengthLong = cobolSchema.getRecordSize.toLong + adjustment1.toLong + adjustment2.toLong
+      if (maxRecordLengthLong < 0) {
+        throw new IllegalArgumentException(
+          s"Invalid RDW length $maxRecordLengthLong. Check 'is_rdw_part_of_record_length' and 'rdw_adjustment'."
+        )
+      }
+      if (isRdwBigEndian && maxRecordLengthLong > 0xFFFFL) {
+        throw new IllegalArgumentException(
+          s"RDW length $maxRecordLengthLong exceeds 65535 and cannot be encoded in big-endian mode."
+        )
+      }
+      if (!isRdwBigEndian && maxRecordLengthLong > Int.MaxValue.toLong) {
+        throw new IllegalArgumentException(
+          s"RDW length $maxRecordLengthLong exceeds ${Int.MaxValue} and cannot be encoded safely."
+        )
+      }
+    }
-          val recordLengthToWriteToRDW = bytesWritten + recordLengthAdj
+          val recordLengthToWriteToRDW = bytesWritten.toLong + recordLengthAdj.toLong
+          if (recordLengthToWriteToRDW < 0) {
+            throw new IllegalArgumentException(
+              s"Invalid RDW length $recordLengthToWriteToRDW for row. Check 'is_rdw_part_of_record_length' and 'rdw_adjustment'."
+            )
+          }
+          if (isRdwBigEndian && recordLengthToWriteToRDW > 0xFFFFL) {
+            throw new IllegalArgumentException(
+              s"RDW length $recordLengthToWriteToRDW exceeds 65535 and cannot be encoded in big-endian mode."
+            )
+          }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 59 - 75, The current pre-check computes recordLengthLong from
cobolSchema.getRecordSize and rejects copybooks whose maximum size exceeds
limits; instead, remove or bypass these global checks in NestedRecordCombiner
and perform per-row RDW length validation inside processRDD where the actual
written row length is known (the code path around the RDW packing at line ~165).
Specifically, stop using cobolSchema.getRecordSize for big-endian/int bounds
when readerParameters.variableSizeOccurs may make rows smaller, and add checks
that compute the RDW length from the actual row bytes before encoding (validate
>=0, <=0xFFFF for isRdwBigEndian, <=Int.MaxValue otherwise) and throw
IllegalArgumentException with clear messages if a per-row length is out of
range; keep processRDD, cobolSchema.copybook, recordLengthLong, isRdwBigEndian,
variableSizeOccurs names to locate the changes.
🧹 Nitpick comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala (1)

147-187: Please add a regression for record_format = V with variable_size_occurs = true.

The existing RDW coverage in spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/NestedWriterSuite.scala:133-156 keeps variable_size_occurs at its default, so this new path is still unpinned. A case where the copybook max size is large but the row is short would lock down both the RDW length calculation and the final copyOf() behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`
around lines 147 - 187, Add a regression test in NestedWriterSuite that covers
record_format = "V" with variable_size_occurs = true: construct a copybook with
a large max record size but produce a short Row so processRDD/writeToBytes takes
the path where hasRdw=true and variableSizeOccurs=true; verify the RDW header
bytes (both big-endian and little-endian cases if supported) reflect
bytesWritten + recordLengthAdj and assert the output array length equals
bytesWritten + startOffset (i.e., the trimmed copy produced by
java.util.Arrays.copyOf), thereby locking down RDW length calculation and the
copyOf behavior referenced by writeToBytes/processRDD.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 183-187: The current logic in NestedRecordCombiner uses
java.util.Arrays.copyOf whenever variableSizeOccurs is true, which truncates the
buffer even when hasRdw is false and removes record framing; change the
condition so copyOf is only used when variableSizeOccurs is true AND hasRdw is
true; if hasRdw is false, return the original full buffer (ar) regardless of
variableSizeOccurs, or alternatively validate earlier and throw/reject when
variable_size_occurs is set but record_format is not V. Update the branch that
checks variableSizeOccurs/recordSize/bytesWritten/startOffset to reference
hasRdw before calling copyOf.

---

Outside diff comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 59-75: The current pre-check computes recordLengthLong from
cobolSchema.getRecordSize and rejects copybooks whose maximum size exceeds
limits; instead, remove or bypass these global checks in NestedRecordCombiner
and perform per-row RDW length validation inside processRDD where the actual
written row length is known (the code path around the RDW packing at line ~165).
Specifically, stop using cobolSchema.getRecordSize for big-endian/int bounds
when readerParameters.variableSizeOccurs may make rows smaller, and add checks
that compute the RDW length from the actual row bytes before encoding (validate
>=0, <=0xFFFF for isRdwBigEndian, <=Int.MaxValue otherwise) and throw
IllegalArgumentException with clear messages if a per-row length is out of
range; keep processRDD, cobolSchema.copybook, recordLengthLong, isRdwBigEndian,
variableSizeOccurs names to locate the changes.

---

Nitpick comments:
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala`:
- Around line 147-187: Add a regression test in NestedWriterSuite that covers
record_format = "V" with variable_size_occurs = true: construct a copybook with
a large max record size but produce a short Row so processRDD/writeToBytes takes
the path where hasRdw=true and variableSizeOccurs=true; verify the RDW header
bytes (both big-endian and little-endian cases if supported) reflect
bytesWritten + recordLengthAdj and assert the output array length equals
bytesWritten + startOffset (i.e., the trimmed copy produced by
java.util.Arrays.copyOf), thereby locking down RDW length calculation and the
copyOf behavior referenced by writeToBytes/processRDD.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8862b6e6-407a-408d-b37f-05c4868108c9

📥 Commits

Reviewing files that changed from the base of the PR and between 01da263 and e7bece4.

📒 Files selected for processing (7)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/Copybook.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/raw/FixedWithRecordLengthExprRawRecordExtractor.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/index/IndexGenerator.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VRLRecordReader.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/extract/BinaryExtractorSpec.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/NestedRecordCombiner.scala

@yruslan yruslan merged commit 4611dba into master Mar 10, 2026
6 of 7 checks passed
@yruslan yruslan deleted the feature/797-add-support-for-variable-size-occurs branch March 10, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ASCII to EBCDIC support with OCCURS and OCCURS+DEPENDING ON

1 participant